An improved long short term memory network for intrusion detection

Over the years, intrusion detection system has played a crucial role in network security by discovering attacks from network traffics and generating an alarm signal to be sent to the security team. Machine learning methods, e.g., Support Vector Machine, K Nearest Neighbour, have been used in building intrusion detection systems but such systems still suffer from low accuracy and high false alarm rate. Deep learning models (e.g., Long Short-Term Memory, LSTM) have been employed in designing intrusion detection systems to address this issue. However, LSTM needs a high number of iterations to achieve high performance. In this paper, a novel, and improved version of the Long Short-Term Memory (ILSTM) algorithm was proposed. The ILSTM is based on the novel integration of the chaotic butterfly optimization algorithm (CBOA) and particle swarm optimization (PSO) to improve the accuracy of the LSTM algorithm. The ILSTM was then used to build an efficient intrusion detection system for binary and multi-class classification cases. The proposed algorithm has two phases: phase one involves training a conventional LSTM network to get initial weights, and phase two involves using the hybrid swarm algorithms, CBOA and PSO, to optimize the weights of LSTM to improve the accuracy. The performance of ILSTM and the intrusion detection system were evaluated using two public datasets (NSL-KDD dataset and LITNET-2020) under nine performance metrics. The results showed that the proposed ILSTM algorithm outperformed the original LSTM and other related deep-learning algorithms regarding accuracy and precision. The ILSTM achieved an accuracy of 93.09% and a precision of 96.86% while LSTM gave an accuracy of 82.74% and a precision of 76.49%. Also, the ILSTM performed better than LSTM in both datasets. In addition, the statistical analysis showed that ILSTM is more statistically significant than LSTM. Further, the proposed ISTLM gave better results of multiclassification of intrusion types such as DoS, Prob, and U2R attacks.


Introduction
With the growth of the internet and the increasing use of technology in our daily lives, cybercrime has become a major concern for individuals, businesses, and governments alike. random weight initialization [23] and overfitting [24]. In other words, although LSTM has been used in many intrusion detection systems but it still suffers from two main limitations: (1) taking high numbers of iterations to find the best weight value of its network which affects the computational costs, and (2) its classification performance is still not high. The objective of this paper is to minimize the number of iterations needed to find the best weight values of LSTM network and improving the classification performance in intrusion detection systems. To achieve this objective, an improved version of LSTM (i.e., ILSTM) was suggested. In the ILSTM, hybrid swarm algorithms, CBOA and PSO, were employed to optimize LSTM weights while using a fewer number of iterations. The ILSTM was then used for proposing an efficient and accurate intrusion detection system for two cases: binary (normal or abnormal) and multi-class (classifying many attacks) classification.
The contribution of this work can be summarised as follows: 1. Proposing a novel and improved version of LSTM called ILSTM in which hybrid swarm intelligence algorithms (i.e., CBOA and PSO) were employed to optimize the weights of the LSTM algorithm which led to better performance using a fewer iterations.
2. Building an efficient (i.e., fewer iterations) and accurate ILSTM-based intrusion detection system for binary (normal and abnormal) and multi-class classification (classifying more than attacks such as DoS, Prob, and U2R attacks).
3. Evaluating the performance of the new ILSTM and the intrusion detection system. Thorough evaluation was done using nine performance metrics (accuracy, detection rate, false alarm rate, precision, f-measure, false negative rate, mathew correlation coefficient and kappa coefficient) under two public datasets (NSL-KDD dataset and LITNET-2020). The ILSTM performed better than LSTM in both datasets. 4. Comparing the results of the proposed solution with various deep learning algorithms. The comparison demonstrated that the proposed ISTLM gave better results in binary and multi-classification of intrusion types such as DoS, Prob, and U2R attacks. 5. Conducing statistical analysis using Wilcoxon Signed-Rank test which showed that ILSTM is more statistically significant than LSTM.
The subsequent sections of this paper are as follows: Section 2 of the paper discusses some related works on swarm intelligence, deep learning, and network intrusion detection methods. Section reference 3 contains all implemented algorithms that were used in the development of the proposed algorithm. Section 4 includes the proposed algorithm (ILSTM). Section 5 provides an experimental setup for implementation the proposed algorithm, parameter setting, performance metrics, and preprocessing phase on the NSL-KDD and the LITNET-2020 datasets. Section 6 illustrates and discusses the performance of the proposed algorithm in binary and multi-class classification as well as comparisons with other deep learning and machine learning algorithms. Section 7 presents the conclusion of this work and future work.
In [8], an integrated intrusion detection model based on a staked denoising auto-encoder and deep belief network (SADE-ELM and DBN-SoftMax)is developed to overcome the shortcomings of existing deep neural network models, including their long learning times and poor classification accuracy, The proposed model only achieves 76.64% for accuracy in binary classification on the NSLKDD dataset.
The authors of [13] developed an intrusion detection model based on bidirectional long short-term memory (BiDLSTM) and convolution LSTM, and the results show that the proposed BiDLSTM is more effective than convolution LSTM. The accuracy of convolution LSTM is 89.81% but BiDLSTM reach to 94.26% in binary classification. Despite BidLSTM gives best result than convolution LSTM, it requires more training time than other compared algorithm.
The authors of [25] proposed a BAT-MC hybrid method of BLSTM and attention mechanism and compare it to other machine learning algorithms (J48, Naive Bay, NBTree, Random Forest, and SVM) using the NSL-KDD dataset in binary classification. The proposed method accomplishes 84.25% for accuracy in binary classification but has the lowest accuracy for U2R and R2L attacks in multi-class classification.
Jiang et al. [26] combined hybrid sampling techniques with deep learning networks (CNN) as a method for intrusion detection. They use one-side-selection (OSS) to reduce the noise samples in the majority categories and increase the minority categories by the synthetic minority over-sampling technique (SMOTE). The accuracy of this method is only 83.58% on the NSL-KDD dataset in binary classification and 82.74% in multi-class classification. However, while this method has a high detection rate for U2R attacks, it has a lower detection rate for other attacks such as (normal, Dos, Prob, R2L).
Chora and Pawlicki [27] studied ANN hyperparameters (activation, optimizers, batch size, epochs, layers, and neurons) for an intrusion detection model using NSL-KDD and CICIDS 2017. When using the parameters tanh, Adam, with 100, 300, 1, and 25, the accuracy was 99.9%. For the other parameters, accuracy dropped to 5.64 percent, demonstrating that the ANN model is sensitive to parameter values. They further did not consider multiclassification of intrusion types such as Dos, Prob, U2R, or R2L.
Multiple researchers have studied the use of swarm intelligence algorithms for machine learning algorithms. ELHasnony et.al [28] developed a hybrid swarm algorithm of BOA and PSO for selecting the best features. Selected features are applied for machine learning algorithm (KNN) with 5 K fold cross-validation for classification. 25 Datasets from UCI machine learning repository and COVID-19 dataset are used to evaluate the proposed algorithm, where proposed algorithm give better result than other swarm algorithms such as BOA, PSO, and GWO. ALsaleh et al. [29] investigated the impact of the salp sarm algorithm (SSA) for feature minimization on improving machine learning network-based anomaly detection classifiers such as XG Boost and Naive Bayes. Improved firefly algorithm is also proposed for optimizing parameters of XGBoost classifier for intrusion detection in [30], the proposed algorithm is tested on the NSL-KDD and UNSW-NB15 datasets. Firefly algorithm reduced the number of features to 19 from 42, where accuracy in binary classification is increased after selection features but other performance metrics are decreased, such as (precision and f-score). In multi-classification, most performance metrics give the best results after selection.
The use of swarm algorithms for deep learning networks was also investigated by researchers. As in [31], where the hybrid deep learning model CNN-OLSTM is used to detect DDos attacks and the grey wolf optimization method is present to choose the best features for detection, but it obtains a very low specificity of 51%. In [11], a feature reduction model based on correlation and information gain, followed by using a RNN classifier for the detection of attacks and non-attacks in a reduced-feature dataset, where 90% of the NSLKDD dataset is used for training. In [32] suggested that using the whale algorithm to optimize the weights of LSTM networks to develop an effective model is called WILS, the abbreviation for whale integrated long short term memory to detect a variety of threats on IoT networks. They used the same dataset for training and testing, using 70% of the NSL-KDD as training data and the remaining 30% for testing data in binary classification.
Some research papers use mathematics algorithms for optimizing weights of LSTM, such as [33] which uses four different optimizer (metaheuristic algorithms) such as harmony search (HS), grey wolf optimizer (GWO), sine cosine (SCA), and ant lion optimization algorithms (ALOA) to train LSTM for maximizing classification accuracy.
The authors in [34] developed a model (OCNN-HMLSTM) by using lion swarm optimization (LSO) for optimization hyperparameters of CNN (spatial features) and using HMLSTM for learning temporal features. The proposed model for NSL-KDD has a binary classification accuracy of 90%, while all attack types (Dos, U2R, Prob, and R2L) have higher false positive rates, reaching 9.92%. In the research paper [35], The authors proposed the firefly algorithm for feature selection of NSL-KDD and KDD Cup 99 datasets, then used DNN for the classification process. Despite the efficiency of the hybrid eFA-DNN framework, it is only proposed for binary classification algorithms.
The authors applied an evolutionary sparse convolution network (ESCNN) in [36] for identifying and tracking attacks in distributed denial of service (DDOS) in the IoT. A variety of DDoS attack-related feature analyses were used to design the technique to reduce network overhead. The proposed network achieves a 98.28% detection rate and 99.29% accuracy in binary classification. In [37], a new feature selection strategy has been proposed using bio-inspired algorithm GWO, in addition authors applied classification method (ELM) refer to extreme learning machine. Modified GWO was tested using the UNSW NB-15 dataset and achieved 78% accuracy. In order to boost the accuracy of a machine learning classifier for intrusion detection systems, relevant features from the UNSW-NB15 and CICIDS-2017 datasets are selected using the artificial bee colony (ABC) algorithm as described in [38]. According to [39], the Firefly algorithm is also used in network intrusion detection to choose features. The Firefly algorithm can choose 10 crucial features from the KDD CUP 99 dataset, which is applied to bayesian networks (BN) and C4.5 based classifiers for anomaly detection. Image recognition has also been applied lately in intrusion detection, as in [40] where a new approach has been proposed using multistage deep learning image recognition that transforms network features into four channel images (Red, Green, Blue, and Alpha) that are used in classification. Results reach 99.8% accuracy for the BOUN Ddos dataset.
From the above literature analysis and summarized in in Table 1, it could be concluded that the performance of the deep learning-based intrusion detection system could be still improved. Such improvement should cover two aspects: binary and multi-class classifications of attacks. It was also noticed that although LSTM has been used in many intrusion detection systems, such as [21,22,25] but it still suffers from two main limitations: (1) taking high numbers of iterations to find the best weight value of its network which affects the computational costs, and (2) its classification performance is still not high. In addition, LSTM performance is impacted by extra problems with random weight initialization [23].

Preliminary work
In this section, an overview of the algorithms used in our proposed algorithm and intrusion detection system is given.

Chaotic map
Since the last decade, chaotic maps have been widely appreciated in the field of optimization due to their dynamic behaviour which helps optimization algorithms explore the search space more dynamically and globally [43]. Chaotic maps are ten mathematical functions that are used for the generation of chaotic sequences. In this paper, iterative map developed in [44] is used instead of random sequences. It has been tested before in [20] and gave better results than other chaotic maps, It is defined as follows: Where a=2 (0, 1)and pi=3.14.

Butterfly Optimization Algorithm (BOA)
BOA is a swarm optimization algorithm that was inspired from nature and mimics the foraging behaviour of social butterflies [17]. BOA searches both locally and globally for the best solution for a given problem. In BOA, information is propagated to all other search agents (solutions) using fragrance to form a collaborative social network. All previous skills in BOA will help in optimization and searching for optimal parameters. In nature, butterflies use sensors to sense or smell fragrance. Each butterfly scatters a different amount of fragrance according to its fitness. A butterfly emits a strong fragrance with intensity when it moves. An algorithm for standard BOA is shown in Algorithm 1. The fragrance of each butterfly can be defined as follows.
Where pf i represents the perceived magnitude of fragrance, I is fragrance intensity. The parameters a and c are the power exponent and the sensor modality, respectively. The parameter (a) is the power exponent defining the variation of fragrance absorption, which affects the butterfly's ability to find the best solution. If a=1, this indicates no absorption of fragrance. That is, the other butterflies will sense all amounts of the fragrance emitted by a particle butterfly. If a=0, then the fragrance emitted by a particle butterfly is not perceivable to any other butterflies. We can see the role of (a) in optimization, so we use the following equation developed in [28] to balance the BOA search capabilities.
Where a s and a f are the initial and final values of a, μ is the tuning parameter and T max is the maximum number of iterations. A value of sensor modality c in the range [0, 1]. Its value can be updated in an iterative BOA process as follows, Where T max is the maximum number of iterations and initial value of c is 0.01. Each butterfly emits fragrance when it moves and the other butterflies are attracted to it according to its magnitude of fragrance. This process is called a global search and can be defined as follows Where x t i is a vector which represents the butterfly (solution) at iteration t, g* is the overall best solution, r is a random number in [0, 1] and f i is a fragrance of ith butterfly. When the butterflies fail to sense the fragrance of the other butterflies, they move randomly in the search space. The process is called local search and it can be defined as follows.
Where x t j ; x t k are two vectors that represent two different butterflies in the same population. Algorithm 1 Butterfly optimization algorithm 1: Set the initial values of the population size n (butterflies), parameters a (power exponent), c sensory modality, switch probability ρ, and the maximum number of iterations Max itr . 2: Set t ≔ 0.

4:
Generate an initial population (butterflies)x t i randomly.

5:
Evaluate the fitness function of each butterfly (solution) fðx t i Þ.
. Local search. 17: end if 18: Evaluate the fitness function of each butterfly (solution) fðx t i Þ.

Chaotic butterfly optimization algorithm (CBOA)
CBOA is a modified version of BOA that uses chaotic maps instead of random variables in Eqs 6 and 7 to update butterfly positions. Thus enhancing BOA's accuracy, as described in [20]. For global search, Eq 7 can be changed as follows.
Where x t i is a vector which represent the butterfly (solution) at iteration t, g* is the overall best solution, C is a chaotic number and f i is a fragrance of ith butterfly. For local search, Eq 6 can be updated as follows.
Where x t j ; x t k are two vectors that represent two different butterflies in the same population.

Particle swarm optimization (PSO)
Kennedy and Eberhart proposed PSO as one of the bio-inspired algorithms in 1995 [45]. PSO is established by certain species' social foraging behaviour, such as schooling behaviour in fish and flocking behaviour in birds. An algorithm for standard PSO is shown in Algorithm 2. PSO consists of particles, each of which has its own velocity and position. In PSO, each particle moves to the best local position Pbest and the best global position gbest, where Pbest is the particle's best local location and gbest is the best position from all the best local positions. Each particle has a velocity defined as follows.
Where i = 1;2. . ..S; and S is swarm size,c 1 and c 2 are factors of constant cognitive and social scaling. W is inertia weight was added to boost performance [28]. W is calculated by the following equation.
Where T max is the maximum number of iterations, T i is a current iteration. W max and W min is the maximum and minimum value of inertia weight respectively. The location of the particle at iteration t is calculated as follows. Calculate the fitness function f 6: Update personal best and global best of each particle 7: Update velocity of the particle using Eq 9 8: Update the position of the particle using Eq 11 9: end for 10: end while

Long short term memory (LSTM)
LSTM is an extension of RNN that able to learn long-term dependencies. The LSTM architecture is more complicated than the RNN architecture; it has four hidden layers that use gates to add and remove cell state information [46].
For one LSTM cell, at time step t, the forget, input and output gates are represented by i t , O t , f t , respectively, as shown in Fig 1 which discussed before in [47]. Forget gate decides which information will be deleted from the cell state based on h tÀ 1 and x t . The input gate determines which information from the current state will be stored in the cell state and updates it using the 'tanh layer' to generate a vector of new contender values. The final output gate decides how the output should look and passes it through the 'tanh layer' to the next neuron. The following equations mathematically describe the relationship between the inputs and outputs at time t and t − 1: Where C denotes the cell state The activation functions are defined by sigma (the sigmoid function) and tanh. x is the input vector, and h t is the output vector. The weights and biases parameters are represented by W and b, respectively. A tanh layer generates a vector of new candidate values, g, which can be added to the state.
In this paper, we develop a deeper LSTM network with four hidden layers and two input and output layers. It starts by mapping inputs to their representations using the feature input layer. It then feeds the sequence to two double LSTM layers. LSTM outputs are then fed to two fully connected layers with the rectified linear unit (RELU) as an activation function. Finally, the fully connected layers learn and compile the extracted data from the LSTM layer to form a final output that passes through an output layer for classification. Fig 2 displays a summary of the LSTM network architecture with four hidden layers as a first phase in the proposed algorithm.

The proposed ILSTM algorithm
The proposed algorithm (ILSTM) consists of a hybrid LSTM network described in section 3.5 and the hybrid swarm algorithm CBOA and PSO, as briefed in Sections 3.3 and 3.4, respectively. The hybrid CBOA+PSO was used for optimising weights of the LSTM network, which helps in improving the training of the LSTM network in a minimum number of iterations. In general, the proposed ILSTM consists of two main phases as described in Fig 2.In the first phase, the LSTM is traditionally trained to get the best parameters and weights of its internal network architecture. In second phase, the hybrid CBOA+PSO (see Algorithm (3)) was used for optimising weights of the trained LSTM network to further find the optimal weights which can improve the accuracy in both binary and multi-class classification1ion while taking a fewer iteration. More details are in the following sub-sections and in the ILSTM algorithm given in Algorithm (3).

Phase 1: Training LSTM network
In order to obtain better weights from the trained network than random weights for phase initialization, we first implemented a deeper LSTM network. The LSTM network was trained with four hidden layers: (LSTM layer 1 + LSTM layer 2) followed by two fully connected Layers (FCL) with rectified linear unit (Relu). The parameters of LSTM network are described in Table 4. When the training accuracy of the LSTM network did not show improvement, the proposed algorithm uses phase 2 to improve performance in a fewer number of iterations.

Phase 2: LSTM network optimization and acceleration
In this phase, by integrating the capabilities of the individual CBOA and PSO algorithms, we were able to combine their benefits for accurately optimizing the weights of an LSTM network. In this case, PSO is employed for the local search for optimal weights while CBOA is used for the global search for optimal weights. The following steps explain how both algorithms were used to optimize the weights of the LSTM network.

Generation of initial population
The proposed algorithm ILSTM initiates with weights obtained by the conventional LSTM in phase 1, and some parameters are used for CBOA, such as switch probability (P), sensor modality (c), and power exponent (a), and other parameters are used for PSO, such as minimum and maximum values of velocity inertia weight (Wmin, Wmax), and constant cognitive factors (c1, c2), as well as a number of iterations (T) and population size (N) from Table 5. At each iteration values of power exponent and sensor, modality are updated based on the current iteration. CBOA and PSO are combined in all steps only in position updating, CBOA is used for global search and PSO is used for local search.

Definition of fitness function
The fitness function of the proposed algorithm is the maximization accuracy of ILSTM which is calculated using the ACC equation in 20.
3. Updating weights of network At each iteration, ILSTM updates LSTM network with new weights and the fragrance of each solution is calculated.

Position updating
Each solution in the population moves to next position according to the value of the chaotic number generated by Eq 1. If value of c is greater than P, ILSTM uses the following equation for updating the position in local search.
Where v tþ1 i is velocity defined before in Eq 9. If value of c is less than P, ILSTM utilises the following equation for updating position in global search.
At each iteration, ILSTM selects optimal solutions(weights) according to a maximum value of the fitness function (the maximum value of accuracy).

Termination condition
When ILSTM algorithm reaches to the maximum number of iterations, optimal weights with the best fitness function are produced. Finally, an optimized ILSTM network with optimal weights was generated. . Counter initialization. 4: for (i = 1 : i � S) do 5: Evaluate the fitness function of each butterfly (weight) fðx t i Þ.

6:
Calculate the fragrance for x t i as shown in Eq 2. 7: Assign the overall best butterfly (weight) g*. 8: end for 9: repeat 10: Set t = t + 1.

11:
for (i = 1 : i � S) do 12: Generate chaotic number C by Eq 1 13: if (C < ρ) then 14: Move butterflies towards the best butterfly g* as shown in Eq 19.

Experimental setup
This section gives details about the experimental setup under which the experimental evaluations, in the next section, are conducted. Firstly, all experiments have been conducted on a laptop with an Intel(R) Core(TM) i5-6300U CPU@ 2.50 GHz and 8.00 GB of RAM and the proposed algorithms were implemented using Matlab R2020a running on Windows 10.
In Section (5.1), an overview of the performance metrics used to assess the quality of the proposed algorithm is given. The section then gives a description and the preprocessing of the two public datasets (NSL-KDD 2009, LITNET-2020) used for the evaluation process. Finally, we test our proposed algorithm on a modern dataset, LITNET-2020, to ensure its efficiency. On the other hand, Table 4 displays a summary of the LSTM network architecture. We compare the algorithm's performance with state-of-the-art and deep learning methods trained and tested on the same dataset (i.e., the NSL-KDD dataset).

Performance metrics
Nine performance metrics, accuracy (ACC), detection rate (DR), false alarm rate (FAR), precision (Prec), specificity (SPC), f-measure, false negative rate (FNR), mathematic correlation coefficient (MCC), and kappa coefficient, were selected to evaluate the performance of ILSTM [34]. A mathematical representation of all measures can be calculated based on four performance measurements, true positive (TP), false positive (FP), true negative (TN), and false negative (FN). These four measures were collected from the confusion matrix [48].
1. Accuracy: the percentage of correctly classified instances to the total number of instances, defined as follows.
2. Recall(DR): the equivalent TPR. It is the percentage of instances identified correctly over the total number of anomaly instances, it can be derived as follows.
6. FAR: known as FPR, the percentage of the number of normal instances which are misclassified as anomalies is divided by the total number of normal instances, can be computed as follows.
7. FNR: can be computed as follows.
8. MCC: varies between −1 and 1 where the best binary classifier obtains positive 1 and worst classifier obtains negative 1. It is computed as follows MCC ¼ ðTP * TNÞ À ðFP * FNÞ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 9. Kappa coefficient: is used to check whether the classifier can process imbalanced data classes successfully. It is calculated as follows.
where Absolute = Accuracy and values of A and B can be obtained as

Dataset 1: NSL-KDD dataset
The NSL-KDD dataset is a refined version of the KDD cup [49]. It has a fair distribution of all types of attacks [50]. Many researchers employ the NSL-KDD to develop an effective intrusion detection algorithm, such as in [26,34,50]. The NSL-KDD includes 41 attributes that are classified as normal or attack traffic [49]. The NSL-KDD is divided into a training dataset (KDDTrain+) and two testing datasets, KDDTest+ and KDDTest-21. All of these datasets have normal records and four types of attack records, such as probe, remote to local (R2L), denial of service (Dos), and user to root (U2R). In this paper, all of the KDDTrain+ dataset is used for training, and all of two other datasets (i.e., KDDTest+ and KDDTest-21) are used for testing, where the training dataset represents 80% of the NSL KDD dataset and the testing dataset represents 20% of the NSL KDD dataset as shown in Table 2. 5.2.1 Dataset preprocessing. KDDTrain+, KDDTest+, and KDDTest-21 datasets are preprocessed before being used for training and testing the LSTM network and the proposed ILSTM. We apply preprocessing step on raw dataset to better make full use of domain knowledge of network traffic. It contains three processes: (1) mapping symbolic features to numeric 1. Data transformation The NSL-KDD dataset has 38 numeric features and 3 non-numeric features such as "protocol-type," "service," and "flag". As LSTM classifier accepts only numeric values, we first convert non-numeric features, as in [51,52], where we replace every single value with an integer in order to handle non-numeric features as in Table 3.
One-hot encoding makes our training data more useful and expressive, and it can be rescaled easily. By using numeric values, we can more easily determine the probability of our values. In particular, one hot encoding is used for our output values since it provides more nuanced predictions than single labels. Each value is converted to binary code, so a protocol type with three values (tcp, udp, and icmp) becomes 1, 2, and 3, which are recognised as [   and Prob so these imbalanced data make a problem with classification as the prediction of the majority class is increased while the detection of the minority class is very low. Prior to [26], hybrid sampling was used, and the results were better than those of a standard dataset. Synthetic minority over-sampling technique (SMOTE) is an over-sampling method [53]. SMOTE forms new minority class examples by matching several minority class examples that lie together. SMOTE can avoid overfitting and make minority class boundaries spread through majority class space. To balance majority classes, the random under sampling (RUS) [54] technique is used to reduce the number of examples of the majority classes in the training dataset.
3. Normalization Some features in the NSL-KDD dataset, such as "duration," "src-bytes", and "dst-bytes" have a large scope between the minimum and maximum values, which can degrade the classification performance [55]. So, we applied the minimum-maximum normalization method [53] which maps features into the normalized range [0, 1]. This method can be defined as in Eq [32].
Where X min and X max are the minimum and the maximum values of feature x.

Parameter setting.
To determine the value of the parameters of the selected algorithms, we study the performance of LSTM network on NSL-KDD. Then the hybrid algorithms (i.e., CBOA+PSO) was used to optimize the weights of LSTM network and finally we evaluate the performance of the proposed algorithm ILSTM in binary classification (normal, anomaly) and five category classification (multi-classification) such as (Dos, Prope, R2L, and U2R). KDDTest+ dataset is used to determine the optimal parameters and network topology of the algorithm. These parameters and network topology are then applied to the KDDTest-21  Table 4, where: 1. The adaptive moment estimation (Adam) algorithm is used to update LSTM network's parameters. For the binary classification, the loss function was cross-entropy while for multi-classification the categorical cross-entropy was used. We applied regularization in range [0.01, 0.001], which came down to adding a cost to the loss function for large weights to ensure that our network does not overfit the data.
2. When the learning rate of the network is too high, the loss function of networks will oscillate without convergence. If the learning rate is too low, the slow convergence rate will hinder the updating of networks. Therefore, choosing an appropriate learning rate is very important for network performance optimization. As in Fig 5 we studied the impact of a set of learning rates [0.1, 0.01, 0.001, 0.0001] in binary and multi-classification on the KDDTest+ dataset and selected the best learning rate that achieves high accuracy.
3. An essential component of choosing the overall neural network architecture is determining the number of neurons in the hidden layers. Applying too few neurons in the hidden layers will result in a problem called underfitting. When too many neurons are used in the hidden layers, a problem known as overfitting occurs and training time is increased. In this paper, we assumed that the number of hidden neurons should be between the size of the input layer and the output layer in a network model, so we applied Eq 33 as in [56] to get the best values.
Where N i = number of input neurons, N o = number of output neurons, N s = number of samples in training data set, α = an arbitrary scaling factor usually be in the range [2,10] and N t = the number order for hidden layer.

Dataset 2: LITNET-2020 dataset
LITNET-2020 dataset is a relatively new dataset collected by LITNET (Lithuanian research and education network) academic network in Lithuania's real-time network traffic. It is a realworld and up-to-date flow-based network dataset [57] which is developed to test IDS systems. In this dataset, there were 85 network flow features and 12 attack types, a summary of the attacks and their instances are given in Table 6.

Dataset preprocessing.
By studying the LITNET-2020 dataset, it was found that it has many features, such as "fwd, opkt, and obyt," which have only one unique value. Additionally, it contains source and destination IP and port numbers which are distinct features and could not be used in attack detection. Therefore, there were only 16 features available for attack classification. Further pre-processing was done where all categorical features were encoded using label encoding. It was also noticed that some features, such as "sp" and "dp", have a large gap between the minimum and maximum values, which can degrade the classification performance. So, we applied the minimum-maximum normalization method 32 which maps features into the normalized range.
Further to that, we use SHAP analysis to explain the proposed algorithm's prediction by calculating the contribution of each feature to the prediction, because SHAP analysis shows the importance of each feature on the target variable [58]. The results of the SHAP analysis is illustrated in Fig 6. 5.3.2 Dataset balancing using hybrid sampling. LITNET-2020 dataset suffers from imbalance problem in class distribution, where the number of normal instances (benign) reaches 3/4 of the size of the dataset, as shown in Table 6. To address this problem, hybrid sampling, as given in point 2 in subsection 5.2.1, was applied to produce a balanced the datasets.

Data splitting approach.
We divided the LITNET-2020 dataset to 60% for training process and 40% for testing and validation. We choose this approach after conducting a small experiment aiming to find out the best data-splitting approach. The results of this experiment are summarized in in Fig 7. As shown in this figure, we divided the LITNET-2020 dataset into 4 different training and testing sets. We then tested all of them and it was found that 60:40 set is the best approach.

Results and discussion
The section reports the experimental results and their discussion which were conducted on the two datasets described above. For each dataset, two main experiments are implemented to study the performance of the proposed algorithm, ILSTM. In the first experiment, the proposed algorithm is investigated for binary classification (i.e., normal or malicious traffic), while in the second experiment, ILSTM is evaluated against multi-class classification (i.e., to differentiate among normal, dos, prob, U2R, or R2L). Also, in each experiment, (1) a statistical analysis (Wilcoxon test) was performed to show the significance of the ILSTM algorithm, and (2) a comparison with other deep learning and machine learning methods was conducted to demonstrate the efficiency of the ILSTM algorithm.

Experiment 1: ILSTM performance for binary classification on NSL-KDD dataset
The aim of this experiment is to assess the performance of the proposed ILSTM for intrusion detection in the case of classifying network traffic into normal or abnormal (i.e., binary classification). This was done on KDDTest+ and KDDTest-21 datasets, as detailed below.

ILSTM Performance using KDDTest+ dataset.
To evaluate the performance of the proposed ILSTM, it is compared with the original LSTM and two optimized versions of LSTM using BOA and CBOA. A summary of the results of this experiment is given in Table 7. These results were recorded from an average of ten runs on the KDDTest+ dataset. It is clear from this table that the proposed ILSTM algorithm gave the best results achieving an accuracy of 91.31%, a specificity of 96.46%, and a FAR of 3.51% (which is a very important value for intrusion detection systems). Other best results are shown in bold text in this table. For detailed results of this experiment, the confusion matrix was reported in Fig 8. Another experiment was conducted on the KDDTest+ dataset to investigate the relationship between the accuracy and the number of iterations of the proposed ILSTM and original LSTM. The results of this experiment were plotted in Fig 9. From this figure, it can be noticed that the ILSTM took iterations less than the LSTM but the latter achieved a higher accuracy. In this Fig, two curves are represented as follows: (a) a conventional LSTM achieved an accuracy   The same above experiment was conducted but using the KDDTest-21 dataset. The aim is to compare the proposed ILSTM with the original LSTM, LSTM-BOA, and LSTM-CBOA. A summary of the results is given in Table 7. An average of ten runs were used to get these results. Also, the confusion matrix for all implemented algorithms in this experiment is shown in Fig 10. From these results, it could be concluded that the proposed ILSTM algorithm gave the best results in an accuracy of 86%, specificity 99%, precision 97.9%, and FAR 0.07 (which is a very important value for intrusion detection systems) and when it is small it means that the IDS is efficient. Other best results are shown in bold text in this table.
Using the KDDTest-21 dataset, we also investigated the relationship between the accuracy and the number of iterations of the proposed ILSTM and original LSTM. The results of this experiment were plotted in Fig 11. From this figure, it can be noticed that after applying the optimization phase using CBOA and PSO on LSTM, the accuracy improved with 18% from iterations 72-76, see Fig 11B while it remained constant at 68.95% for LSTM without any optimization see Fig 11A).
Where x i is a vector of actual data, y i is a vector of predict data and n is the number of instances in the testing dataset. Fig 12 summarizes the results of the optimized weights of ILSTM. It can be seen that accuracy is not only improving but also the lowest value of MSE is reached in binary classification (normal or abnormal traffic) using two datasets: KDDTest+ in Fig 12(a) and KDDTest-21 in Fig 12(b).

Wilcoxon signed-rank test for binary classification on KDDTest+ and KDDTest-21 datasets.
In this section, we implement the Wilcoxon signed-rank test to demonstrate the effectiveness of the proposed algorithm in binary classification using two testing datasets. As in Table 8, statistics results from Wilcoxon are defined between the accuracy of conventional LSTM and the accuracy of the proposed algorithm (ILSTM), where the mean difference is -10.2 for KDDTest + and equal to -22.86 for KDDTest -21. The value of z is 2.8031, and the pvalue is 0.00512 for the KDDTest+ and KDDTest-21 datasets. The results in Table 8 show that the p = 0.00512 which is lower than the significance level (0.05), so the null hypothesis should be rejected and it means that there is a significant difference between the proposed ILSTM algorithm and the other algorithms. This statistical analysis confirms the numerical results reported above.
From Tables 7 and 8, three remarks can be noticed. Firstly, when using a chaotic map with BOA as in CBOA algorithm, the results of all evaluation metrics were improved. This is due to the fact that a chaotic map increases the search space for new solutions while avoiding local minima. Secondly, when using Eq 11 of PSO algorithm in local search instead of Eq 8, the search for new solutions is improved because the velocity helps in searching for local and global best solutions. So, using CBOA and PSO in our proposed ILSTM algorithm improved the results of intrusion detection in both binary classification in most performance metrics such as (Acc, Spc, Prec, FAR, f1-score, MCC, and Kapp) on KDDTest+ and KDDTest-21 datasets. Finally, the optimization process improved the LSTM network with better results in performance metrics and statistical tests than the conventional LSTM.
6.1.5 Comparison of the ILSTM algorithm with related methods. In order to objectively evaluate the performance of ILSTM, we conducted a comparison with other deep and machine learning methods that were implemented in the intrusion detection literature. In this comparison, we used machine learning and deep learning methods reported in previous work such as [13,14,25]. The results of this comparison are reported in Fig 13 for machine learning methods and in Fig 14 for deep learning methods. From these figures, it can be concluded that our proposed algorithm outperformed all other algorithms in binary classification.
In addition, Table 9 shows the results of comparison with other methods suggested for binary classification. Those methods were used in [34]. From this table, it can be seen that the proposed ILSTM algorithm has achieved the best results in most measures, where the best values are shown in bold text.

Experiment 2: ILSTM performance for multi-class classification on NSL-KDD dataset
The aim of this experiment is to assess the performance of the proposed ILSTM for intrusion detection in the case of classifying network traffic into different types of attacks (i.e., multiclassification) where there are 5 classes of data Normal and 4 types of attacks (Dos, Prob, U2R, R2L). Three sub-experiments are conducted. The first and second are designed for the performance evaluation of ILSTM under the KDDTest+ and KDDTest-21 datasets, respectively while comparing it with the most related work. The third experiment was for comparison with the other related work under eight performance metrics.
6.2.1 ILSTM performance using the KDDTest+ dataset. This experiment aims to study the performance of ILSTM on accurately identifying four types of attacks (Dos, Prob, U2R,    Table 10. In addition, the confusion matrix for all implemented algorithms is shown in Fig 15. From these results, it can be noticed that ILSTM outperformed almost all other algorithms under all evaluation metrics. This is due to the integration of CBOA with PSO in the proposed ILSTM algorithm. Also, the ILSTM was compared with the LSTM in terms of the number of iterations needed to achieve the highest accuracy and the results were plotted in Fig 16. From this figure, it can be seen that the proposed ILSTM algorithm achieved a higher accuracy with fewer iterations compared with the conventional LSTM, which reaches 79.72% with 100 iterations as shown in Fig 16(a); however, the ILSTM only needs 72 iterations to attain an accuracy of 88.17% as shown in Fig 16(b). So, the proposed ILSTM can improve intrusion detection performance and also save on computational costs.

ILSTM performance using KDDTest-21.
To further evaluate the proposed ILSTM, we repeated the same experiment above but using a different dataset, namely KDDTest-21. The results of this experiment are summarized in Table 11. Also, confusion matrices of all compared algorithms in this experiment are plotted in Fig 17. From this table and the confusion matrix, it can be seen that the ILSTM achieved the best results when compared to other implemented algorithms.
Under the multi-classification scenario, we also investigated the relationship between the accuracy of ILSTM and its number of iterations in comparison with the original LSTM. The results of this experiment were plotted in Fig 18. From this figure, it can be noticed that the optimization of LSTM using CBOA and PSO can boost the accuracy by 20% in 10 iterations, while the accuracy of conventional LSTM remains constant starting from iteration 60 to iteration 100. It can be seen that the conventional LSTM gave an accuracy of 52.54% with 100 iterations as illustrated in Fig 18(a), while ILSTM achieved an accuracy of 76.73% in the same number of iterations as illustrated in Fig 18(b), where optimization process started just after the LSTM accuracy becomes constant, i.e., at iteration 60.

Efficiency of ILSTM algorithm in multi-class classification.
For a further thorough evaluation of the optimization of LSTM, the MSE was computed using Eq 34 in the context of multi-class classification. The results are summarized in Fig 19 which shows

PLOS ONE
iSLTM network for intrusion detection that optimizing the weights of a conventional LSTM network can enhance the multi-class accuracy (i.e., detecting different type of attacks) while also achieving the lowest MSE in multi-class classification for the KDDTest+ and KDDTest-21 datasets, where the KDDTest+ in Fig 19(a) provided the lowest MSE compared with the KDDTest-21 in Fig  19(b).

Wilcoxon signed-rank test for multi-class classification on KDDTest+ and KDDTest-21 datasets.
To evaluate the significance of the ILSTM, we implemented the Wilcoxon signed-rank test in multi-class classification using two testing datasets, KDDTest+ and KDDTest-21 datasets. The Wilcoxon statistics results of the accuracy of the proposed ILSTM algorithm and the conventional LSTM are shown in Table 12. From this table, it can be seen that the value of z is 2.8031 and the p-value is.00512 for the KDDTest + and KDDTest-21 datasets, and the mean difference for KDDTest+ is -7.99 and the mean difference for KDDTest-21 is -21.29. The results also show that the p = 0.00512 is lower than the significance level (0.05). This means that the null hypothesis should be rejected and it means that there is a significant difference between the proposed ILSTM algorithm and the other algorithms. This statistical analysis confirms the numerical results reported above.
As a conclusion of the results given in Tables 10 and 11, ILSTM achieved better than LSTM, LSTM-BOA, and LSTM-CBOA in terms of DR, Spec, FNR, and MCC for all attack classes (Normal, Dos, Prob, R2l, U2R). This means using a hybrid optimization of CBOA and PSO helped in increasing search space for best solutions and finding global optimal solutions in all testing datasets. 6.2.5 Comparison of the proposed ILSTM algorithm and the other related algorithms. As in the case of binary classification, we also compared ILSTM with other published work about multi-classification attacks. This comparison included deep and machine learning methods which have been published in the literature of intrusion detection context [13,14,25]. As shown in Figs 20 and 21, the proposed ILSTM algorithm outperformed other existing machine learning and deep learning methods. Given the comparison summarized in Table 13, it could be noticed that the proposed ILSTM algorithm achieved the lowest false alarm rate in all types of attacks. Also, the ILSTM can reach higher DR, Precision and f-measure in most types of attacks, the best results are written in bold text. In comparison with other methods, it can be seen that ILSTM gave superior results in Dos, Prob and R2l attacks in terms of Recall, Precision, F-score and FAR. Additionally, ILSTM can produce good results in normal but not as well under R2L attacks. This may be due to the limited number of instants in U2R and R2L attacks in the datasets. Hybrid sampling can help to resolve this issue by achieving results that are as excellent as those of R2L attacks. The best FAR results for U2R attacks, while also increasing accuracy to 31.88, but still not the best in most metrics for U2R attacks.

Experiment 3: ILSTM performance for binary classification on LITNET-2020 dataset
Similarly to NSLKDD dataset, the ILSTM algorithm was evaluated on the LITNET-2020 dataset, described earlier. Also, the same nine performance metrics were used for evaluating the conventional LSTM and the proposed ILSTM algorithms. A summary of the results is given in Table 14. The confusion matrix for binary classification of LSTM before applying the optimization is given in Fig 22A). On the other hand, the results of applying optimization (i.e., ILSTM) is given in Fig 22B). Also, Fig 23 shows how the accuracy increased after using the ILSTM which starts the optimization process at iteration 58 to improve the accuracy of the original LSTM which was a constant at iteration 68. As shown in Fig 23B, ILSTM improved the accuracy from 92% to 94% at iteration 84 while the conventional LSTM gave an accuracy

Experiment 4: ILSTM performance for mulit-class classification on LITNET-2020 dataset
In this experiment, the proposed ILSTM is also compared with the original LSTM for a multiclassification scenario. In this experiment, the nine performance metrics were employed in  Using the LITNET-2020, we also investigated the relationship between the accuracy of ILSTM and the required number of iterations and compared it with the original LSTM. The results of this experiment were plotted in Fig 26. This figure shows that the proposed ILSTM algorithm achieved a higher accuracy rate (i.e., 95.77%) compared with the accuracy of the conventional LSTM, i.e., 91.04%. ILSTM results were achieved using 90 iterations after which the accuracy value became constant while LSTM achieved it accuracy results using 85 iterations after which the accuracy became constant.

Conclusion and future work
In this paper, we developed an improved version of LSTM (called ILSTM) to improve the accuracy of LSTM-based intrusion detection system. The ILSTM made use of a combination of two swarm optimisation algorithms, CBOA and PSO, to determine the best weights for the LSTM network. The ILSTM consists of two phases: one for training the deeper LSTM network with the best parameters to get initial weights and another for optimizing these weights using CBOA and PSO. A comprehensive evaluation was conducted to assess the efficiecy of the proposed ILSTM algorithm for intrusion detection systems. Two public datasets (NSL-KDD and LITNET-2020) and nine evaluation metrics were used. The results showed that the proposed ILSTM algorithm is better than the orginal LSTM and two optimized versions of it (i.e., LSTM-PSO, and LSTM-CBOA) in two main cases: binary and multi-class classification. These results were also achieved using a few number of iterations and these were supported by confusion matrices for all the implemented algorithms. Additionally, by comparing the proposed ILSTM algorithm with published and related machine and deep learning methods, the ILSTM yields superior results in terms of accuracy, detection rate, precision, and f-measure when testing on KDDTest+ and KDDTest-21. It was noticed that our proposed algorithm accomplished excellent results when applying optimization, but it needs more time to optimize the population within large datasets. So in future work, it is planned to apply optimization with a faster algorithm. The limitations of the study are as follows: (1) problem of optimization size and computational effort where the proposed algorithm can take a long time when applying it on big data with millions of instances, as in the case of the LITNET-2020 dataset.
(2) computational resources: if the problem size is too large, it might not be possible to store the processing data in the memory of the computer running this algorithm.